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This study investigates whether item characteristic 
curves are the same for black students as for white students in tl^e 
United States* The data analyzed were the answer sheets of 226 9 black 
students and 2285 white students taking the SB^item Verbal Sectiop of 
the College Board's Scholastic Aptitud,e Test* The study of item 
characteristic curves is a feasible and fruitful way to investigate 
item biases. It has definite advantages over less sophisticated 
methods. More than a third of the 85 test items were found to have 
different characteristic curves for blacks and for whites at the 5% 
level of statistical significance. It is in many cases not clear from 
reading it why a particular item is biased in a particular way* 
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A Study of Item Bias^ Using Item Characfceristic Curve Theory 

Frederics M* Lord 
EdUGational Testing Service 
Priinaeton, New Jersey OSJi+O^ U.S.A. 

I am going to report on a study of bias in tc^r.L items* The study 
cott^area data from about 2250 whites with data -from an equal number of 
blacks t Both groups are about kk percent male. The test administered 
ii an 85-item verbal test uBed for Gollege admissjons-^the Verbal Section 
of the April 1975 Scholastle Aptitude Test of the College Entrance 
E3£amination Board* There are four kinds of verbal items in the testi 
verbal analogies^ antonyms^ word meanlngj and reading Gomprehension. 

Does this test measure the same thl,ng for blacks as it does for 
^ites? Are there some items that should be riln^ved from the tent 00 
that the remaining items will measure appropriately in both groups? Thesf 
are the, questions tliat we are trying to answer* 

The general plan and design of the study was developed by Gary Marco^ 
Director^ Statistical Analy-siSj College Board Programs Division^ at 
Educational Testing Service* Marco will be the senior author of the 
final report of this study. The study is partially aupported by the 
CEEB? Before giving more details^ I will talk about certain previous 
approaches to the study of item blas^ also about item characteristic 
curve theory; upon which the, present study is based* 

Figure 1 plots item difficulty for blacks against item difficulty for 
whites* For the present^ I use the term 'item difficulty' to refer to the 
proportion of correct answers given to an item. The data used to obtain 
Flgi^e 1 are the same data already described* The 85 crosses in the 
figure represent the 85 items in the verbal test* Items falling along 



-2- 



the dashed line in the figiire are items that are as easy for blacks as 
for whites. Items below this line are easier for whites. The solid 
oblique line is a straight line fitted to the scatter of points. The 
iolid line differs from the dashed line because whites score higher on 
the test than blacks. If all the items fell directly on the solid line^ we 
eo^d say that the items are all equal3^ biaeedi otj conceivablyj equally 
unbiased* 

It hai baen customa^ to look at the scatter of items about the solid 
line and to pick out the items lying relatively far from the line and 
canelder them as at^^ical and undesirable* In the middle of Figure 1 
there is one item lying far below the line that appears to be strongly 
biased in favor of whites^ also another item far above the line that 
favors blacks much more than other items. A conunon judgment would be 
that both of these items should be removed from the teat* 

In rigure 1 the standard error of a single proportion is about -01^ 
or less. Thus most of the scattering of pointi is not attributable to 
aampline fluctuations. Unfortunately^ the failure to fall along a 
straight line is not necessari^ attributable to differences among items 
in bias* This is true for six different reasons, which I will discuss 
next. 

In the first place^ we should es^ect the scatter in Figure 1 to ftll 
along a curved line, not a straight line* Logicallyi the curved line must 
pasi throu^ the jpoints (0,0) and (l^l)« If the groups perfomed equally 
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weU on the teat, the points could fall along the dashed line; but since 
one group performs better than the other, most of the points must lie 
to one side of the dashed line and the relationship must be curved. 

Careful studies attempt to avoid this curvature by transforming the 
proportions. If an analysis of variance is to be done, the conventional 
traniforraatlon is .the arcslne transformation. The real pu^ose of the 
arcslne transformation is to equalize sampling variance. Whatever 
effect it may have in straightening the line of relationship is purely 
Incidental. 

The transforroation usuAlljr used to strai^ten the line of relationship 
la the inverse norttal tranafonnation. The proportion of correct answers 
is replaced by the relative deviate that would cut off the same proportion 
of tha area under the itandard normal curve. The result of this trans- 
formation is .shown in- Figure 2. Indeed, the points in Figure 2 fall about 
a line th&t is more nearly stral^t than ms the case in Figure 1. 

Unfortunately, there are theoretical objections to the inverse normal 
transformation. Suppose that the test were to contain several items 
so difficult that everyone siTOly guessed at random on these items. Since 
the items here are five-choice items, the proportions of correct answers 
for both blacks and whites would be .20. This means that the curve in 
Figure 2 should pass throu^ the point (-1.81+, -1.81^). It again appears 
that i^en there is guessing, the points in Figure 2 cannot be eKpected to 
f^U strictly along a stral^t line unless the two groins perform ciquaJ.ly 
well on the test. 



Next^ there is a reason why the items cannot be expected all to 
fall along a single curv'e* If items at one level of discriminating 
power fall along a Gertaln aujL've^ then items at a different level or 
diBcrimlnatlng power will fall along a different ci^ve. The reason is thet 
the more dlsarlminatlng items would produce more difference between blacks 
and whites than would the leas diierimlnatlng items* 

This last leads to the startling conclusion that the proportion 
of correct answers really Is not a meaeui^e of item dlfficulty l Let tne 
come back to this point in a moment. 

Figure 3 shows some typical Item charecteriitlc curYes. The scale 
along the baseline represents the ability of the examinee * The Item 
characteristic curve shows the probability of a correct answer as a 
function of examinee ability* The general shape of the curve follows 
nattja'all^ from the fact that success on the item tends to increase with 
ability^ but the probability of success can never exceed 1*0^ nor fall 
below Oi Such curves typically have one point of inf lection* 

In item characteristic curye theory it is usually assumed that such 
curves can be defined by three item parameters • TOie item difficulty b 
represents the ability level corresponding to the point of inflection. ■ 
Mhen there is no guessii^^ b is the ability level at which the examinee 
has a 50 percent chance of answering the item correctly. The hi^er the 
value pf b j the more difficult thn item. 

The slope at the Inflection point is proportional to the itt*m param- 
eter . a , which represents the discriminating power of tlm item. When 
th^re is no guessing^ the slope at the point of Inflection imder a,, 
commonly used model is siijrply a/v^ • 
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The item parameter c represents the probability of success for 
eammlnees of infinitely low ability. Thus c defines the lower asyi^tote 
of the item characteristlG curve. Tt is noniero ^lenever examinees can 
guesi the correct answer. Typically^ but not always, c is less than the 
chance level that would be achieved by an examinee guessing at random* 
Th% raaion is that test fleyelopers spend much effort and Ingenuity 
providing attractive distractors to the itemsj with the result that 
people who do not know tha answer tsrpically do less well than if they 
had chosen their responses at random* 

rigurt k shows two rather different item characteristic curves; in- 
verted .on the baseline are the distributions of ability for two different 
groupi of examinees* First of all you shoiy.d note: l*he item dif-- 
ficulty b should be the same re^rdless of the group from which it is 
determined; the ability recjulred for a certain level of performance by 
an individual does not depend on the ability diitribution of other people 
in some group. The same holds true for the slope a at the inflection 
pointy and for the lower asy^tote c * ^is invar iance is the outstanding 
advantage of the Item parameters used in item characteristic cia*ve theory- 
In principle^ within reasonable limits^ the parameters should stay the 
same regariless of the group tested* 

Now please ^ note carefully the fol3.owingt In group A , item 1 iB 
answered correctly less often than item ft. In group B ^ the opposite 
oceura* If we use the proportion of correct answers as a measure of item 
dlfflciLLtyj we find that item 1 Is easier than item 2 for one group^ but 
harder than 2 for the other group. It is for this reason that I assert 



that proportion of correot answers in a group of examtneeo is not a meoBUre 
of item dijtfi^ilty. 

This proportion not only describes the test item but alBo 
deBcrlbes the group teited. This is a basic objection to analyzing item 
bias by the approaches suggested by Figures 1 and 2. 

Still anottier difficulty with these coriventional approaches tnay b^ 
mentlonedi The black group and the "^ite group repreaented in Figure 1 
are apparently not comparable In verbal skills. It might be argued that we 
should base our anaJ^rsls on white and black groups that are matched on 
verbal skills* Such matching I0 difflcuQ.t to carry out in practice^ 
however. We cannot properly match on a test conrposed of the items that 
are to be studied^ since this would introduoe ipurioua relationships* 
If we try to match on a parallel form of the same testj, we will be 
matching on a fallible score when we should be matching on a true score* 
niere wiU be a regression effect that will present proper matching. 

Orie way to compare the performance of blacks and whites at the sanE 
level of verbal skill is to compare the characteristiG cuwe of an item 
for blacks with the characteristic curve of the same item for whites. 
Any difference between the durves indicates some kind of bias. This' 
co^arison is made in the study I am reporting today. 

Before proceeding let me note the following^ however* Suppose^ to 
take an extreme example^ certain items in a test are tau^t to one group 
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of studenti and not taught to another^ while other items are taught to 
both groups. This my of teaching increases the dimansionality of 
whatever is measured by the teat- If the items would otherwise have been 
faetorially unidltnensional^ this \my of teaching will introduce additional 
dimensions* If we ignore this and analyze all items as if they were 
unldlmensional^ we cannot es^ect all item charmcteriitie curves to be the 
same for both groups* Since blacks and whites are escposed to different 
learning environments^ the situation may be quite similar for them* With 
thli In mindj let us twcn to a report of the present study* 

We used a computer program^ LOGIST^ which simultaneously estimates 
the ability of each examinee and the a , b ^ and c parameters of 
each item* The answer sheets of the 2250 whites and the answer sheets 
of the 2250 blaoki were first run separately on this program. 

It is Inherent in the native of the problem that the origiri and the 
unit for meaauring ability .oannot be datemlned from the data* Thus the 
item parameters f^om the black group cannot be compared directly with the 
item parameters from the white group* To determine a cOMnon origin and 
unit^ we plotted the b parameters (item difficulties) for the black 
group against the b parameters for the white group, ^e plot is shown 
as Figure 5* This plot is the same as Figures 1 and 2 except that here 
item difficulty is measured by the parameter b « 

According to the ice model the vmlues of b for blacks and for whitee 
can onl^ differ in origin and unit of measurement* The straight line 
fitted to the 85 points is the first principal asrts- This line was used 
to put all item parameters on the same scale. 

. • ' 9 
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We cr-n now test the null hypothesis that a particular item has the 
same item aharaGteristic curve for blacks and for whites.* Thr? asymptotic 
significance test used will be discussad in a rrament. Forty-six of the 85 
iteme were found to be slgnlf icantly different at the five percent level. 

The study could have been stopped at this point. However^ it might 
be argued that a test composed of so mr^ biased Items did not provide 
an adequate basis for measuring exatninee ability* To meet this objection^ 
the Items showing significant dlfferenQi bejrand the 15 percent level 
were eliminated^ leaving 32 Items for which the black and white item 
characteristic curves were very similar. 

The black and white groups were now combined and the data for the 

52-ltem test run on iDDGIST, Ignoring oolof differences. In this way^ the 

ability parameters of blacks and whites on the 52-item test were all 

estimated on the same scale* 

As a final step, the entire first step of the study was repeated^ now 

treating the ability parameters just estimated as given. Since the 
ability parameters are all on the same scale^ the item parameters ob- 
tained for the black group are now con^arable with the item parameters 
obtained for the -^Ite group. 

As^^totic significance tests were again carried out to test the null 
hypothesis that for a given item the black and the white item characteristic 

"^Actuallyj in order to make a significance test possible^ the value 
of the c parameter for an item was required to be the same for blacks 
and for ^ltes« Thus the curves could only differ in a and b param- 
eters * nils complication is glossed over here but will be fully covered 
in the final report* 
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curves are identical. This hypothesis was redected at the 5 percent level 
for 38 itama out of 85. mm distribution of the 85 iteras over different 
eignlflGanee levels is shown in Table 1- 

I should now discuss the rationale for the significance tests- 
Actually^ it is not 'presently possible to specify with certainty even 
the asymptotic standard error of the maximimi likelihood estimates used 
In thii study* An approximation^ based on certain reasonable assumptions 
to these standard errors was used* Rather than trying to justify the 
approximation mathemtically^ it may be more satisfactory to justify 
it by the results of an empirical study carried out especially for this 
purpose^ as follows* 

The black and white groups were combined into a total^ group of 
about 4500 individuals* IHiie total group was divided at random into 
two gro^i^ which we may deilgnate 'blue' and 'red*' The entire 
statiBtieal ana^^sis Involving at least three LOGIST runs tos repeated 
for these two random groups* At the endi aspfiptotlc significance tests 
were carried out to test the null hypothesis that the blue item characteristic 
curves were the same as the red. The distribution of the 85 test items 
over various significance levels is shown in Table 2, 

Since the blue and the red groups were drawn at random^ the 85 items 
ahoiid be rectangularly distributed over the range of significance levels* 
This TOuld mean just eight and one-half Items in each probability interval 
of width ilO. Bie frequencies shown In Table 2 are sui^rlsingly close to 
thlSj suggeeting tlmt the statistical procedi^e used is actuEUy a good 
approxiffiation* ^en we corapare Tables 1 and 2^ it seems that a third 

' ■ ' ... ■ 11 ' ■ 
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or tnore of the items really have different characteristic curves for 
blacks and for whites. 

Figure 6 illustrates one such curve--the curve for Item 71. The 
base, line of the figure represents examinee ability over the range from 
-It standard deviations to +3 standard deviations. The vertical axis 
shows the probability of a correct answer to item 71. The dashed curve 
is the ice for blacks | the solid curve is the ice for whites. At ^^e 
extremei of each group, individuals are shown as points, in order to give 
an idea of where the data lie. In the middle of the curve, where most of 
the data lie, Individuml points are not shown. It should be remembered 
that a particular individual In practice answers an item either cor- 
rectly or Incorrectly "We do not actually observe a probability for a 

single individual. 

The two curves for item fl are significantly different beyond the 
.01 level. Interestingly, high-level white students do better than high- 
level black students on this item, but low-level black students do better 
than low-level white st'udents. 

A similar situation appears in the next figwe which shows the results 
for item 2* In addition^ we find that item 2 doei not discriminate among 
black studanti but does dliGrlmlrmte among white itudents. 

Item 71 and item 2 illustrate a kind of difference that wuld not be 
foimd by the techniques shown in Figure 1 and Flg^e 2* Each of theie 
items falls along the curve of relation in Figures 1 and 2 and does not 
appear to be more difflciat over all for one group or the other. 



. last figure shows the item characteristic curvea for item 2k ^ 
'^ich,lB a very difficult item* Regardless of abtlity levels^ black 
itudents are unsuccessful on this item* For white students^ however^ 
the item doai discriminate at very high ability levels. 

I have studied the items in the test and compared them with the 
statistical rasults without reaching any startling insight into the 
reaeons for the special biases of individual items. Unfortunately^ I 
canaot hand out a copy of all the test questions together with the table 
of the itatistical results for you to study* "nie reason Is that the 
items We have analyzed are still in our aotive item pool for use in 
building new college admissions tests* The items together with the past 
statistical analyses are eypenslve^ and the confidentiality of the items 
must be maintained* 1 have permission to read to you the three items 
represented by the last three illustrations* Perhaps you will see 
soma clea_r explanation for the statistical results. 

71* ^ deficiency of calories means a shortage of the 
supply of calories to the bo^ in relation to the 
^ them* 

(a) production of 

(B) vmrlations between 
(c) assessmant of 

- Cd) requirement for 

(l) conneQtlons among 

2. INJUTOi (a) release (b) refrain 

(C) smooth (d) embellish (e) heal 



13 



-12- ' 

2k • We do not have a full grasp on e^^erience until 
we have symbolized it; we dannot until we 

have ^^^-^ • 

(a) understand . * learned 

(b) ^comnimicate . . thought 
" (C) infom •» revised 

^^^ (b) eacplaln *« hypothesised ' . 

(E) Imow verbalised 

Th% final report of thli study will include not only the matei'ial 
I have presented here^ but alsOj for oon^arlsoni the statistioal analysis 
of the same data by the method Illustrated in Figure 2* A more thorough 
study of the items at that time may reveal more clearly the reasons for 
the biases shown. 

Does ttie test measure the same psychQloglcal trait for blaoks as for 
^Ites? If it meaBured totally different traits for b^jLcka and for 
whites^ the soatterplot in Figure 5 would show little or no relationship 
between the Item difficulty indices for the two groups. 

In view of thls^ the study shows that the test does meastoe 
aHproKiffiately the same skill for blaoks and whites. Some items show up 
differently in the two groins, but the differences are rather small. 

^e Item characteristic curve techniques used here can pick out 
certain atypical 'items that should be cut out from the test. It is to 
be hoped that more oareful study of such analyies will help us imderstand 

better why certain items are biased^ why certain groups of people respond 

I ....... 

differently ttaan otheri on Gertaln Items, and ^at can be done about 
this. 
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Flg. 5. Mfflaulty paratoeters (b ) 
for 85 items for blacks and for whites* 




Fig* 6# Blafik'(daihed^ and white (solid) 
oharactarletlc ourves for item 71 




Fig. 7. Gharaaterlstlc c^ves for Item 2; 
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